Method for executing automatic evaluation of transmission quality of audio signals using source/received-signal spectral covariance

ABSTRACT

A source signal (e.g. a speech sample) is processed or transmitted by a speech coder  1  and converted into a reception signal (coded speech signal). The source and reception signals are separately subjected to preprocessing  2  and psychoacoustic modelling  3 . This is followed by a distance calculation  4 , which assesses the similarity of the signals. Lastly, an MOS calculation is carried out in order to obtain a result comparable with human evaluation. According to the invention, in order to assess the transmission quality a spectral similarity value is determined which is based on calculation of the covariance of the spectra of the source signal and reception signal and division of the covariance by the standard deviations of the two said spectra. 
     The method makes it possible to obtain an objective assessment (speech quality prediction) while taking the human auditory process into account.

This application is the national phase under 35 U.S.C. §371 of PCTInternational Application No. PCT/CH99/00269 which has an Internationalfiling date of Jun. 21, 1999, which designated the United States ofAmerica.

TECHNICAL FIELD

The invention relates to a method for making a machine-aided assessmentof the transmission quality of audio signals, in particular of speechsignals, spectra of a source signal to be transmitted and of atransmitted reception signal being determined in a frequency domain.

PRIOR ART

The assessment of the transmission quality of speech channels is gainingincreasing importance with the growing proliferation and geographicalcoverage of mobile radio telephony. There is a desire for a method whichis objective (i.e. not dependent on the judgment of a specificindividual) and can run automatically.

Perfect transmission of speech via a telecommunications channel in thestandardized 0.3-3.4 kHz frequency band gives about 98% sentencecomprehension. However, the introduction of digital mobile radionetworks with speech coders in the terminals can greatly impair thecomprehensibility of speech. Moreover, determining the extent of theimpairment presents certain difficulties.

Speech quality is a vague term compared, for example, with bit rate,echo or volume. Since customer satisfaction can be measured directlyaccording to how well the speech is transmitted, coding methods need tobe selected and optimized in relation to their speech quality. In orderto assess a speech coding method, it is customary to carry out veryelaborate auditory tests. The results are in this case far fromreproducible and depend on the motivation of the test listeners. It istherefore desirable to have a hardware replacement which, by suitablephysical measurements, measures the speech performance features whichcorrelate as well as possible with subjectively obtained results (MeanOpinion Score, MOS).

EP 0 644 674 A2 discloses a method for assessing the transmissionquality of a speech transmission path which makes it possible, at anautomatic level, to obtain an assessment which correlates strongly withhuman perception. This means that the system can make an evaluation ofthe transmission quality and apply a scale as it would be used by atrained test listener. The key idea consists in using a neural network.The latter is trained using a speech sample. The end effect is thatintegral quality assessment takes place. The reasons for the loss ofquality are not addressed.

Modern speech coding methods perform data compression and use very lowbit rates. For this reason, simple known objective methods, such as forexample the signal-to-noise ratio (SNR), fail.

SUMMARY OF THE INVENTION

The object of the invention is to provide a method of the type mentionedat the start, which makes it possible to obtain an objective assessment(speech quality prediction) while taking the human auditory process intoaccount.

The way in which the object is achieved is defined by the features ofclaim 1. According to the invention, in order to assess the transmissionquality a spectral similarity value is determined which is based oncalculation of the covariance of the spectra of the source signal andreception signal and division of the covariance by the standarddeviations of the two said spectra.

Tests with a range of graded speech samples and the associated auditoryjudgment (MOS) have shown that a very good correlation with the auditoryvalues can be obtained on the basis of the method according to theinvention. Compared with the known procedure based on a neural network,the present method has the following advantages:

Less demand on storage and CPU resources. This is important forreal-time implementation.

No elaborate system training for using new speech samples.

No suboptimal reference inherent in the system. The best speech qualitywhich can be measured using this measure corresponds to that of thespeech sample.

Preferably, the spectral similarity value is weighted with a factorwhich, as a function of the ratio between the energies of the spectra ofthe reception and source signals, reduces the similarity value to agreater extent when the energy of the reception signal is greater thanthe energy of the source signal than when the energy of the receptionsignal is lower than that of the source signal. In this way, extrasignal content in the reception signal is more negatively weighted thanmissing signal content.

According to a particularly preferred embodiment, the weighting factoris also dependent on the signal energy of the reception signal. For anyratio of the energies of the spectra of reception to source signal, thesimilarity value is reduced commensurately to a greater extent thehigher the signal energy of the reception signal is. As a result, theeffect of interference in the reception signal on the similarity valueis controlled as a function of the energy of the reception signal. Tothat end, at least two level windows are defined, one below apredetermined threshold and one above this threshold. Preferably, aplurality of, in particular three, level windows are defined above thethreshold. The similarity value is reduced according to the level windowin which the reception signal lies. The higher the level, the greaterthe reduction.

The invention can in principle be used for any audio signals. If theaudio signals contain inactive phases (as is typically the case withspeech signals) it is recommendable to perform the quality evaluationseparately for active and inactive phases. Signal segments whose energyexceeds the predetermined threshold are assigned to the active phase,and the other segments are classified as pauses (inactive phases). Thespectral similarity described above is then calculated only for theactive phases.

For the inactive phases (e.g. speech pauses) a quality function can beused which falls off degressively as a function of the pause energy:$A^{\frac{\log \quad 10{({Epa})}}{\log \quad 10{({E\quad \max})}}}$

A is a suitably selected constant, and Emax is the greatest possiblevalue of the pause energy.

The overall quality of the transmission (that is to say the actualtransmission quality) is given by a weighted linear combination of thequalities of the active and of the inactive phases. The weightingfactors depend in this case on the proportion of the total signal whichthe active phase represents, and specifically in a non-linear way whichfavours the active phase. With a proportion of e.g. 50%, the quality ofthe active phase may be of the order of e.g. 90%.

Pauses or interference in the pauses are thus taken into accountseparately and to a lesser extent than active signal pauses. Thisaccounts for the fact that essentially no information is transmitted inpauses, but that it is nevertheless perceived as unpleasant ifinterference occurs in the pauses.

According to an especially preferred embodiment, the time-domain sampledvalues of the source and reception signals are combined in data frameswhich overlap one another by from a few milliseconds to a few dozenmilliseconds (e.g. 16 ms). This overlap forms—at least partially—thetime masking inherent in the human auditory system.

A substantially realistic reproduction of the time masking is obtainedif, in addition—after the transformation to the frequency domain—thespectrum of the current frame has the attenuated spectrum of thepreceding one added to it. The spectral components are in this casepreferably weighted differently. Low frequency components in thepreceding frame are weighted more strongly than ones with higherfrequency.

It is recommendable to carry out compression of the spectral componentsbefore performing the time masking, by exponentiating them with a valueα<1 (e.g. α=0.3). This is because if a plurality of frequencies occur atthe same time in a frequency band, an over-reaction takes place in theauditory system, i.e. the total volume is perceived as greater than thatof the sum of the individual frequencies. As an end effect, it meanscompressing the components.

A further measure for obtaining a good correlation between theassessment results of the method according to the invention andsubjective human perception consists in convoluting the spectrum of aframe with an asymmetric “smearing function”. This mathematicaloperation is applied both to the source signal and to the receptionsignal and before the similarity is determined.

The smearing function is, in a frequency/loudness diagram, preferably atriangle function whose left edge is steeper than its right edge.

Before the convolution, the spectra may additionally be expanded byexponentiation with a value ε>1 (e.g. ε=4/3). The loudness functioncharacteristic of the human ear is thereby simulated.

The detailed description below and the set of patent claims will givefurther advantageous embodiments and combinations of features of theinvention.

BRIEF DESCRIPTION OF THE DRAWINGS

In the drawings used to explain the illustrative embodiment:

FIG. 1 is an outline block diagram to explain the principle of theprocessing;

FIG. 2 is a block diagram of the individual steps of the method forperforming the quality assessment;

FIG. 3 shows an example of a Hamming window;

FIG. 4 shows a representation of the weighting function for calculatingthe frequency/tonality conversion;

FIG. 5 shows a representation of the frequency response of a telephonefilter;

FIG. 6 shows a representation of the equal-volume curves for thetwo-dimensional sound field (Ln is the volume and N the loudness);

FIG. 7 shows a schematic representation of the time masking;

FIG. 8 shows a representation of the loudness function (sone) as afunction of the sound level (phon) of a 1 kHz tone;

FIG. 9 shows a representation of the smearing function;

FIG. 10 shows a graphical representation of the speech coefficients inthe form of a function of the proportion of speech in the source signal;

FIG. 11 shows a graphical representation of the quality in the pausephase in the form of a function of the speech energy in the pause phase;

FIG. 12 shows a graphical representation of the gain constant in theform of a function of the energy ratio; and

FIG. 13 shows a graphical representation of the weighting coefficientsfor implementing the time masking as a function of the frequencycomponent.

In principle, the same parts are given the same reference numbers in thefigures.

EMBODIMENTS OF THE INVENTION

A concrete illustrative embodiment will be explained in detail belowwith reference to the figures.

FIG. 1 shows the principle of the processing. A speech sample is used asthe source signal x(i). It is processed or transmitted by the speechcoder 1 and converted into a reception signal y(i) (coded speech signal)The said signals are in digital form. The sampling frequency is e.g. 8kHz and the digital quantization 16 bit. The data format is preferablyPCM (without compression).

The source and reception signals are separately subjected topreprocessing 2 and psychoacoustic modelling 3. This is followed bydistance calculation 4, which assesses the similarity of the signals.Lastly, an MOS calculation 5 is carried out in order to obtain a resultcomparable with human evaluation.

FIG. 2 clarifies the procedures described in detail below. The sourcesignal and the reception signal follow the same processing route. Forthe sake of simplicity, the process has only been drawn once. It is,however, clear that the two signals are dealt with separately until thedistance measure is determined.

The source signal is based on a sentence which is selected in such a waythat its phonetic frequency statistics correspond as well as possible touttered speech. In order to prevent contextual hearing, meaninglesssyllables are used which are referred to as logatoms. The speech sampleshould have a speech level which is as constant as possible. The lengthof the speech sample is between 3 and 8 seconds (typically 5 seconds).

Signal conditioning: In a first step, the source signal is entered inthe vector x(i) and the reception signal is entered in the vector y(i).The two signals need to be synchronized in terms of time and level. TheDC component is then removed by subtracting the mean from each samplevalue: $\begin{matrix}{{{x(i)} = {{x(i)} - \quad {\frac{1}{N}{\sum\limits_{k = 1}^{N}{x(k)}}}}}\quad {{y(i)} = {{y(i)} - \quad {\frac{1}{N}{\sum\limits_{k = 1}^{N}{y(k)}}}}}} & (1)\end{matrix}$

The signals are furthermore normalized to common RMS (Root Mean Square)levels because the constant gain in the signal is not taken intoaccount: $\begin{matrix}{{{x(i)} = {{x(i)} \cdot \frac{1}{\sqrt{\frac{1}{N}{\sum\limits_{k = 1}^{N}{x(k)}^{2}}}}}}{{y(i)} = {{y(i)} \cdot \frac{1}{\sqrt{\frac{1}{N}{\sum\limits_{k = 1}^{N}{y(k)}^{2}}}}}}} & (2)\end{matrix}$

The next step is to form the frames: both signals are divided intosegments of 32 ms length (256 sample values at 8 kHz). These frames arethe processing units in all the later processing steps. The frameoverlap is preferably 50% (128 sample values).

This is followed by the Hamming windowing 6 (cf. FIG. 2). In a firstprocessing step, the frame is subjected to time weighting. A so-calledHamming window (FIG. 3) is generated, by which the signal values of aframe are multiplied. $\begin{matrix}{{{{hamm}(k)} = {0.54 - {0.46 \cdot {\cos \left( \frac{2\quad {\pi \left( {k - 1} \right)}}{255} \right)}}}},\quad {1 \leq k \leq 255}} & (3)\end{matrix}$

The purpose of the windowing is to convert a temporally unlimited signalinto a temporally limited signal through multiplying the temporallyunlimited signal by a window function which vanishes (is equal to zero)outside a particular range.

x(i)=x(i)*hamm(i), y(i)=y(i)*hamm(i), 1≦i≦255  (4)

The source signal x(t) in the time domain is now converted into thefrequency domain by means of a discrete Fourier transform (FIG. 2: DFT7). For a temporally discrete value sequence x(i) with i=0, 1, 2, . . ., N−1, which has been created by the windowing, the complex Fouriertransform C(j) for the source signal x(i) when the period is N is asfollows: $\begin{matrix}{{c_{x}(j)} = {{\sum\limits_{n = 0}^{N - 1}{{{x(i)} \cdot {\exp \left( {{- j} \cdot \frac{2\pi}{N} \cdot n \cdot j} \right)}}\quad 0}} \leq j \leq {N - 1}}} & (5)\end{matrix}$

The same is done for the coded signal, or reception signal y(i):$\begin{matrix}{{c_{y}(j)} = {{\sum\limits_{n = 0}^{N - 1}{{{y(i)} \cdot {\exp \left( {{- j} \cdot \frac{2\quad \pi}{N} \cdot n \cdot j} \right)}}\quad 0}} \leq j \leq {N - 1}}} & (6)\end{matrix}$

In the next step, the magnitude of the spectrum is calculated (FIG. 2:taking the magnitude 8). The index x always denotes the source signaland y the reception signal:

Px _(j) ={square root over (c_(x)(j)·conjg(c_(x)(j)))}, Py _(j) ={squareroot over (c_(y)(j)·conjg(c_(y)(j)))}  (7)

Division into the critical frequency bands is then carried out (FIG. 2:Bark transformation 9).

In this case, an adapted model by E. Zwicker, Psychoakustik, 1982, isused. The basilar membrane in the human ear divides the frequencyspectrum into critical frequency groups. These frequency groups play animportant role in the perception of loudness. At low frequencies, thefrequency groups have a constant bandwidth of 100 Hz, and at frequenciesabove 500 Hz it increases proportionately with frequency (it is equal toabout 20% of the respective midfrequency). This correspondsapproximately to the properties of human hearing, which also processesthe signals in frequency bands, although these bands are variable, i.e.their mid-frequency is dictated by the respective sound event.

The table below shows the relationship between tonality z, frequency f,frequency group with ΔF and FFT index. The FFT indices correspond to theFFT resolution, 256. Only the 100-4000 Hz bandwidth is of interest forthe subsequent calculation.

Z [Bark] F(low) [Hz] ΔF [Hz] FFT Index 0 0 100 1 100 100 3 2 200 100 6 3300 100 9 4 400 100 13 5 510 110 16 6 630 120 20 7 770 140 25 8 920 15029 9 1080 160 35 10 1270 190 41 11 1480 210 47 12 1720 240 55 13 2000280 65 14 2320 320 74 15 2700 380 86 16 3150 450 101 17 3700 550 118 184400 700 19 5300 900 20 6400 1100 21 7700 1300 22 9500 1800 23 120002500 24 15500 3500

The window applied here represents a simplification. All frequencygroups have a width ΔZ(z) of 1 Bark. The tonality scale z in Bark iscalculated according to the following formula: $\begin{matrix}{{Z = {{13 \cdot {\arctan \left( {0.76 \cdot f} \right)}} + {3.5 \cdot {\arctan \left\lbrack \left( \frac{f}{7.5} \right)^{2} \right\rbrack}}}},} & (8)\end{matrix}$

with f in [kHz] and Z in [Bark].

A tonality difference of one Bark corresponds approximately to a 1.3millimetre section on the basilar membrane (150 hair cells). The actualfrequency/tonality conversion can be performed simply according to thefollowing formula: $\begin{matrix}{{{{Px}_{i}^{\prime}\lbrack j\rbrack} = {\frac{1}{\Delta \quad f_{j}}*{\sum\limits_{I_{f}{\lbrack j\rbrack}}^{I_{t}{\lbrack j\rbrack}}{{q(f)}*{{Px}_{i}\lbrack k\rbrack}}}}},{{{Py}_{i}^{\prime}\lbrack j\rbrack} = {\frac{1}{\Delta \quad f_{j}}*{\sum\limits_{I_{f}{\lbrack j\rbrack}}^{I_{t}{\lbrack j\rbrack}}{{q(f)}*{{Py}_{i}\lbrack k\rbrack}}}}},} & (9)\end{matrix}$

I_(f)[j] being the index of the first sample on the Hertz scale for bandj and I_(I)[j] that of the last sample. Δf_(j) denotes the bandwidth ofband j in Hertz. q(f) is the weighting function (FIG. 5). Since thediscrete Fourier transform only gives values of the spectrum at discretepoints (frequencies), the band limits each lie on such a frequency. Thevalues at the band limits are only given half weighting in each of theneighbouring windows. The band limits are at N*8000/256 Hz.

N=3,6,9, 13, 16, 20, 25, 29, 35, 41, 47, 55, 65, 74, 86, 101, 118

For the 0.3-3.4 kHz telephony bandwidth, 17 values on the tonality scaleare used, which then correspond to the input. Of the resulting 128 FFTvalues, the first 2, which correspond to the frequency range 0 Hz to 94Hz, and the last 10, which correspond to the frequency range 3700 Hz to4000 Hz, are omitted.

Both signals are then filtered with a filter whose frequency responsecorresponds to the reception curve of the corresponding telephone set(FIG. 2 telephone band filtering 10):

Pfx _(i) [j]=Filt[j]·Px _(i) ′[j], Pfy _(i) [j]=Filt[j]·Py _(i)″[j]  (10)

where Filt[j] is the frequency response in band j of the frequencycharacteristic of the telephone set (defined according to ITU-Trecomendation Annex D/P.830).

FIG. 5 graphically represents the (logarithmic) values of such a filter.

The phon curves may also optionally be calculated (FIG. 2: phon curvecalculation 11). In relation to this:

The volume of any sound is defined as that level of a 1 kHz tone which,with frontal incidence on the test individual in a plane wave, causesthe same volume perception as the sound to be measured (cf. E. Zwicker,Psychoakustik, 1982). Curves of equal volume for different frequenciesare thus referred to. These curves are represented in FIG. 6.

In FIG. 6 it can be seen, for example, that a 100 Hz tone at a levelvolume of 3 phon has a sound level of 25 dB. However, for a volume levelof 40 phon, the same tone has a sound level of 50 dB. It can also beseen that, e.g. for a 100 Hz tone, the sound level must be 30 dB louderthan for a 4 kHz tone in order for both to be able to generate the sameloudness in the ear. An approximation is obtained in the model accordingto the invention through multiplying the signals Px and Py byacomplementary function.

Since human hearing overreacts when a plurality of spectral componentsin one band occur at the same time, i.e. the total volume is perceivedas greater than the linear sum of the individual volumes, the individualspectral components are compressed. The compressed specific loudness hasthe unit 1 sone. In order to perform the phon/sone transformation 12(cf. FIG. 2), in the present case the input in Bark is compressed withan exponent α=0.3:

Px _(i) ′[j]=(Pfx _(i) ′[j])^(α) , Py _(i) ′[j]=(Pfy _(i)′[j])^(α)  (11)

One important aspect of the preferred illustrative embodiment is themodelling of time masking.

The human ear is incapable of discriminating between two short testsounds which arrive in close succession. FIG. 7 shows the time-dependentprocesses. A masker of 200 ms duration masks a short tone pulse. Thetime where the masker starts is denoted 0. The time is negative to theleft. The second time scale starts where the masker ends. Three timeranges are shown. Premasking takes place before the masker is turned on.Immediately after this is the simultaneous masking and after the end ofthe masker is the post-masking phase. There is a logical explanation forthe post-masking (reverberation). The premasking takes place even beforethe masker is turned on. Auditory perception does not occur straightaway. Processing time is needed in order to generate the perception. Aloud sound is given fast processing, and a soft sound at the thresholdof hearing a longer processing time. The premasking lasts about 20 msand the post-masking 100 ms. The post-masking is therefore the dominanteffect. The post-masking depends on the masker duration and the spectrumof the masking sound.

A rough approximation to time masking is obtained just by the frameoverlap in the signal preprocessing. For a 32 ms frame length (256sample values and 8 kHz sampling frequency) the overlap time is 16 ms(50%). This is sufficient for medium and high frequencies. For lowfrequencies this masking is much longer (>120 ms). This is thenimplemented as addition of the attenuated spectrum of the precedingframe (FIG. 2: time masking 15). The attenuation is in this casedifferent in each frequency band: $\begin{matrix}{{{{Px}_{i}^{''}\lbrack j\rbrack} = \frac{\left( {{{Px}_{i}^{\prime}\lbrack j\rbrack} + {{{Px}_{i - 1}^{\prime}\lbrack j\rbrack}*{{coeff}(j)}}} \right)}{1 + {{coeff}(j)}}},{{{Py}_{i}^{''}\lbrack j\rbrack} = \frac{\left( {{{Py}_{i}^{\prime}\lbrack j\rbrack} + {{{Py}_{i - 1}^{\prime}\lbrack j\rbrack}*{{{coef}f}(j)}}} \right)}{1 + {{coeff}(j)}}}} & (12)\end{matrix}$

where coeff(j) are the weighting coefficients, which are calculatedaccording to the following formula: $\begin{matrix}{{{{coeff}(j)} = {\exp\left( {- \quad \frac{{Frame}\quad {Length}}{\frac{\left( {2 \cdot {Fc}} \right)}{\left( {\left( {{2 \cdot {NoOfBarks}} + 1} \right) - {2 \cdot \left( {j - 1} \right)}} \right) \cdot \eta}}} \right)}}{{j = 1},2,3,\ldots \quad,{NoOfBarks}}} & (13)\end{matrix}$

where FrameLength is the length of the frame in sample values e.g. 256,NoOfBarks is the number of Bark values within a frame (here e.g. 17). Fcis the sampling frequency and η=0.001.

The weighting coefficients for implementing the time masking as afunction of the frequency component are represented by way of example inFIG. 13. It can clearly be seen that the weighting coefficients decreasewith increasing Bark index (i.e with rising frequency).

Time masking is only provided here in the form of post-masking. Thepremasking is negligible in this context.

In a further processing phase, the spectra of the signals are “smeared”(FIG. 2: frequency smearing 13). The background for this is that thehuman ear is incapable of clearly discriminating two frequencycomponents which are next to one another. The degree of frequencysmearing depends on the frequencies in question, their amplitudes andother factors.

The reception variable of the ear is loudness. It indicates how much asound to be measured is louder or softer than a standard sound. Thereception variable, found in this way is referred to as ratio loudness.The sound level of a 1 kHz tone has proved useful as standard sound. Theloudness 1 sone has been assigned to the 1 kHz tone with a level of 40dB. In E. Zwicker, Psychoakustik, 1982, the following definition of theloudness function is described:${Loudness} = {2^{\frac{L_{1{kHz}}40}{10}}\lbrack{dB}\rbrack}$

FIG. 8 shows a loudness function (sone) for the 1 kHz tone as a functionof the sound level (phon).

In the scope of the present illustrative embodiment, this loudnessfunction is approximated as follows:

Px _(i) ′″[j]=(Px _(i) ″[j])^(ε) , Py _(i) ′″[j]=(Py _(i)″[j])^(ε)  (14)

where ε=4/3.

The spectrum is expanded at this point (FIG. 2: loudness functionconversion 14).

The spectrum as it now exists is convoluted with a discrete sequence offactors (convolution). The result corresponds to smearing of thespectrum over the frequency axis. Convolution of two sequences x and ycorresponds to relatively complicated convolution of the sequences inthe time range or multiplication of their Fourier transforms. In thetime domain, the formula is: $\begin{matrix}{{c = {{conv}\left( {x,y} \right)}},\quad {{c(k)} = {\sum\limits_{j = 0}^{n - 1}\quad {{x(j)} \cdot {y\left( {k + 1 - j} \right)}}}},} & (15)\end{matrix}$

m being the length of sequence x and n the length of sequence y. Theresult c has length k=m+n−1. j=max(1, k+1−n): min(k,m).

In the frequency domain:

conv(x,y)=FFT ⁻¹(FFT(x)*FFT(y)).  (16)

x is replaced in the present example by the signal Px′″ and Py′″ withlength 17 (m=17) and y is replaced by the smearing function Λ withlength 9 (n=9). The result therefore has the length 17+9−1=25 (k=25).

Ex _(i)=conv(Px _(i)′″,Λ(f)), Ey _(i)=conv(Py _(i)′″,Λ(f))  (17)

Λ(·) is the smearing function whose form is shown in FIG. 9. It isasymmetric. The left edge rises from a loudness of −30 at frequencycomponent 1 to a loudness of 0 at frequency component 4. It then fallsoff again in a straight line to a loudness of −30 at frequency component9. The smearing function is thus an asymmetric triangle function.

The psychoacoustic modelling 3 (cf. FIG. 1) is thus concluded. Thequality calculation follows.

The distance between the weighted spectra of the source signal and ofthe reception signal is calculated as follows:

Q _(TOT)=η_(sp) ·Q _(sp)+η_(pa) ·Q _(pa), η_(sp)+η_(pa)=1  (18)

where Q_(sp) is the distance during the speech phase (active signalphase) and Q_(pa) the distance in the pause phase (inactive signalphase). η_(sp) is the speech coefficient and η_(pa) is the pausecoefficient.

The signal analysis of the source signal is firstly carried out with theaim of finding signal sequences where the speech is active. A so-calledenergy profile En_(profile) is thus formed according to:${{En}_{profile}(i)} = \left\{ \begin{matrix}{1,{\ldots \quad {if}\quad \left( {{x(i)} \geq {{SPEECH}_{-}{THR}}} \right)}} \\{0,{\ldots \quad {if}\quad \left( {{x(i)} < {{SPEECH}_{-}{THR}}} \right)}}\end{matrix} \right.$

SPEECH_THR is used to define the threshold value below which speech isinactive. It usually lies at +10 dB to the maximum dynamic response ofthe AD converter. With 16 Bit resolution, SPEECH_THR=−96.3+10=−86.3 dB.In PACE, SPEECH_THR=−80 dB.

The quality is indirectly proportional to the similarity Q_(TOT) betweenthe source and reception signals. Q_(TOT)=1 means that the source andreception signals are exactly the same. For Q_(TOT)=0 these two signalshave scarcely any similarities. The speech coefficient η_(sp) iscalculated according to the following formula: $\begin{matrix}{{\eta_{s\quad p} = {{- {\mu \left( \frac{\mu - 1}{\mu} \right)}^{P_{s\quad p}}} + \mu}},\quad {0 \leq P_{s\quad p} \leq 1}} & (19)\end{matrix}$

where μ=1.01 and Psp is the speech proportion.

As shown in FIG. 10, the effect of the speech sequence is greater(speech coefficient greater) if the speech proportion is greater. Forexample, at μ=1.01 and Psp=0.5 (50%), this coefficient η_(sp)=0.91. Theeffect of the speech sequence in the signal is thus 91% and that of thepause sequence only 9% (100−91). At μ=1.07 the effect of the speechsequence is smaller (80%).

The pause coefficient is then calculated according to:

η_(pa)=1−η_(sp)  (20)

The quality in the pause phase is not calculated in the same way as thequality in the speech phase.

Q_(pa) is the function describing the signal energy in the pause phase.When this energy increases, the value Q_(pa) becomes smaller (whichcorresponds to the deterioration in quality): $\begin{matrix}{Q_{p\quad a} = {{{- k_{n}} \cdot \left( \frac{k_{n} + 1}{k_{n}} \right)^{\frac{\log \quad 10{(E_{p\quad a})}}{\log \quad 10{(E_{\max})}}}} + k_{n} + 1 + m}} & (21)\end{matrix}$

k_(n) is a predefined constant and here has the value 0.01. E_(pa) isthe RMS signal energy in the pause phase for the reception signal. Onlywhen this energy is greater than the RMS signal energy of the pausephase in the source signal does it have an effect on the Q_(pa) value.Thus, E_(pa)=max(Eref_(pa),E_(pa)). The smallest E_(pa) is 2. E_(max) isthe maximum RMS signal energy for given digital resolution (for 16 bitresolution, E_(max)=32768). The value m in formula (21) is thecorrection factor for E_(pa)=2, so that then Q_(pa)=1. This correctionfactor is calculated thus: $\begin{matrix}{m = {{k_{n} \cdot \left( \frac{k_{n} + 1}{k_{n}} \right)^{\frac{\log \quad 10{(E_{\min})}}{\log \quad 10{(E_{\max})}}}} - k_{n}}} & (22)\end{matrix}$

For E_(max)=32768, E_(min)=2 und k_(n)=0.01 the value of m=0.003602. Thebasis kn*(kn+1/kn) can essentially be regarded as a suitably selectedconstant A.

FIG. 11 represents the relationship between the RMS energy of the signalin the pause phase and Q_(pa).

The quality of the speech phase is determined by the “distance” betweenthe spectra of the source and reception signals.

First, four level windows are defined. Window No. 1 extends from −96.3dB to −70 dB, window No. 2 from −71 dB to −46 dB, window No. 3 from −46dB to −26 dB and window No. 4 from −26 dB to 0 dB. Signals whose levelslie in the first window are interpreted as a pause and are not includedin the calculation of Q_(sp). The subdivision into four level windowsprovides multiple resolution. Similar procedures take place in the humanear. It is thus possible to control the effect of interference in thesignal as a function of its energy. Window four, which corresponds tothe highest energy, is given the maximum weighting.

The distance between the spectrum of the source signal and that of thereception signal in the speech phase for speech frame k and level windowi Q_(sp)(i, k), is calculated in the following way: $\begin{matrix}{{{Q_{s\quad p}\left( {i,k} \right)} = \frac{G_{({i,k})} \cdot n \cdot {\sum\limits_{j = 1}^{n}\quad {\left( {{E\quad {x(k)}_{j}} - \overset{\_}{E\quad {x(k)}}} \right) \cdot \left( {{E\quad {y(k)}_{j}} - \overset{\_}{E\quad {y(k)}}} \right)}}}{\sqrt{{n \cdot {\sum\limits_{j = 1}^{n}\quad {E\quad {x(k)}_{j}^{2}}}} - \left( {\sum\limits_{j = 1}^{n}\quad {E\quad {x(k)}_{j}}} \right)^{2}} \cdot \sqrt{{n \cdot {\sum\limits_{j = 1}^{n}\quad {E\quad {y(k)}_{j}^{2}}}} - \left( {\sum\limits_{j = 1}^{n}\quad {E\quad {y(k)}_{j}}} \right)^{2}}}},} & (23)\end{matrix}$

where Ex(k) is the spectrum of the source signal and Ey(k) the spectrumof the reception signal in frame k. n denotes the spectral resolution ofa frame. n corresponds to the number of Bark values in a time frame(e.g. 17). The mean spectrum in frame k is denoted {overscore (E(k))}.G_(i,k) is the frame- and window-dependent gain constant whose value isdependent on the energy ratio $\frac{P\quad y}{P\quad x}.$

A graphical representation of the G_(i,k) value in the form of afunction of the energy ratio is represented in FIG. 12.

When this gain is equal to 1 (energy in the reception signal equals theenergy in the source signal), G_(i,k)=1 as well.

When the energy in the reception signal is equal to the energy in thesource signal, G_(i,k) is equal to 1. This has no effect on Q_(sp). Allother values lead to smaller G_(i,k) or Q_(sp), which corresponds to agreater distance from the source signal (quality of the reception signallower). When the energy of the reception signal is greater than that ofthe source signal:>1, the gain constant behaves according to theequation:$G = {1 - {ɛ_{{HI}\quad} \cdot {\left( {\log_{10}\left( \frac{P\quad y}{P\quad x} \right)} \right)^{0.7}.}}}$

When this energy ratio${\left( \frac{P\quad y}{P\quad x} \right) < 1},$

then:$G = {1 - {ɛ_{LO} \cdot {\left( {\log_{10}\left( \frac{P\quad y}{P\quad x} \right)} \right)^{0.7}.}}}$

The values of ε_(HI) and ε_(LO) for the individual level windows can befound in the table below.

Window No. i ε_(HI) ε_(LO) θ γ_(SD) 2 0.05 0.025 0.15 0.1 3 0.07 0.0350.25 0.3 4 0.09 0.045 0.6 0.6

The described gain constant causes extra content in the reception signalto increase the distance to a greater extent than missing content.

From formula (23) it can be seen that the numerator corresponds to thecovariance function and the denominator corresponds to the product oftwo standard deviations. Thus, for the k-th frame a and level window i,the distance is equal to: $\begin{matrix}{{Q_{s\quad p}\left( {i,k} \right)} = {G_{({i,k})} \cdot \frac{{Cov}_{k}\left( {{P\quad x},{P\quad y}} \right)}{{\sigma_{x}(k)} \cdot {\sigma_{y}(k)}}}} & (24)\end{matrix}$

The values θ and γ_(SD) for each level window, which can likewise beseen from the table above, are needed for converting the individualQ_(sp)(i,k) into a single distance measure Q_(sp).

As a function of the content of the signal, three Q_(sp)(i) vectors areobtained whose lengths may be different. In a first approximation, themean for the respective level window i is calculated as: $\begin{matrix}{{Q_{i} = {\frac{1}{N}{\sum\limits_{j = 0}^{N}{Q_{sp}(i)}_{j}}}},} & (25)\end{matrix}$

N is the length of the Q_(sp)(i) vector, or the number of speech framesfor the respective speech window i.

The standard deviation SD_(i) of the Q_(sp)(i) vector is then calculatedas: $\begin{matrix}{{SD}_{i} = \sqrt{\frac{{\sum{Q_{sp}(i)}} - \left( {\sum{Q_{sp}(i)}} \right)^{2}}{N},}} & (26)\end{matrix}$

SD describes the distribution of the interference in the coded signal.For burst-like noise, e.g. pulse noise, the SD value is relativelylarge, whereas it is small for uniformly distributed noise. The humanear also perceives a pulselike distortion more strongly. A typical caseis formed by analogue speech transmission networks such as e.g. AMPS.

The effect of how well the signal is distributed is thereforeimplemented in the following way:

Ksd(i)=1+SD _(i)·γ_(SD)(i),  (27)

with the following definitions

 Ksd(i)=1, for Ksd(i)>1 and

Ksd(i)=0, for Ksd(i)<0.

and lastly

Qsd _(i) =Ksd(i)*Q _(i),  (28)

The quality of the speech phase, Q_(sp), is then calculated as theweighted sum of the individual window qualities, according to:$\begin{matrix}{{Q_{sp} = {\sum\limits_{i = 2}^{4}{U_{i} \cdot {Qsd}_{i}}}},} & (29)\end{matrix}$

The weighting factors U_(i) are determined using

U _(i)=η_(sp) ·p _(i),  (30)

η_(sp) being the speech coefficient according to formula 19 and p_(i)corresponding to the weighted degree of membership of the signal towindow i and being calculated using$p_{i} = {\frac{O_{i}}{\sum\limits_{l = 2}^{4}O_{i}}\quad {with}}$$O_{i} = {\frac{N_{i}}{N_{sp}} \cdot {\theta_{i}.}}$

N_(i) is the number of speech frames in window i, N_(sp) is the totalnumber of speech frames and the sum of all θs is always equal to 1:${\sum\limits_{i = 2}^{4}\theta_{i}} = 1.$

I.e.: the greater the ratio $\frac{N_{i}}{N_{sp}}$

or the θ_(i) are, the more meaning the interference in the respectivespeech frame has.

Of course, for a gain constant independent of signal level, the valuesof ε_(HI) ⁻, ε_(LO), θ and γ_(SD) can also be chosen as equal for eachwindow.

FIG. 2 represents the corresponding processing segment by the distancemeasure calculation 16. The quality calculation 17 establishes the valueQtot (formula 18).

Last of all comes the MOS calculation 5. This conversion is needed inorder to be able to represent Q_(TOT) on the correct quality scale. Thequality scale with MOS units is defined in ITU T P.800 “Method forsubjective determination of transmission quality”, 08/96. Astatistically significant number of measurements are taken. All themeasured values are then represented as individual points in a diagram.A trend curve is then drawn in the form of a second-order polynomthrough all the points.

MOS _(o) =a·(MOS _(PACE))² +b·MOS _(PACE) +c  (31)

This MOSo value (MOS objective) now corresponds to the predetermined MOSvalue. In the best case, the two values are equal.

The described method can be implemented with dedicated hardware and/orwith software. The formulae can be programmed without difficulty. Theprocessing of the source signal is performed in advance, and only theresults of the preprocessing and psychoacoustic modelling are stored.The reception signal can e.g. be processed on line. In order to performthe distance calculation on the signal spectra, recourse is made to thecorresponding stored values of the source signal.

The method according to the invention was tested with various speechsamples under a variety of conditions. The length of the sample variedbetween 4 and 16 seconds.

The following speech transmissions were tested in a real network:

normal ISDN connection.

GSM-FR <−> ISDN and GSM-FR alone.

various transmissions via DCME devices with ADPCM (G.726) or LD-CELP(G.728) codecs.

All the connections were run with different speech levels.

The simulation included:

CDMA Codec (IS-95) with various bit error rates.

TDMA Codec (IS-54 and IS-641) with echo canceller switched on.

Additive background noise and various frequency responses.

Each test consists of a series of evaluated speech samples and theassociated auditory judgment (MOS). The correlation obtained between themethod according to the invention and the auditory values was very high.

In summary, it may be stated that

the modelling of the time masking,

the modelling of the frequency masking,

the described model for the distance calculation,

the modelling of the distance in the pause phase and

the modelling of the effect of the energy ratio on the quality provideda versatile assessment system correlating very well with subjectiveperception.

What is claimed is:
 1. Method for making a machine-aided assessment ofthe transmission quality of audio signals, in particular of speechsignals, spectra of a source signal to be transmitted and of atransmitted reception signal being determined in a frequency domain,characterized in that, in order to assess the transmission quality, aspectral similarity value is determined by dividing the covariance ofthe spectra of the source signal and of the reception signal by theproduct of the standard deviations of the two spectra and is used in thecalculation of transmission quality.
 2. Method according to claim 1,characterized in that the spectral similarity value is weighted with again factor which, as a function of a ratio between the energies of thereception and source signals, reduces the similarity value to a greaterextent when the energy of the reception signal is greater than theenergy in the source signal than when the energy of the reception signalis lower than the energy in the source signal.
 3. Method according toclaim 2, characterized in that the gain factor reduces the similarityvalue as a function of the energy of the reception signal to a greaterextent the higher the energy of the reception signal is.
 4. Methodaccording to one of claims 1 to 3, characterized in that inactive phasesare extracted from the source and reception signals, and in that thespectral similarity value is determined only for the remaining activephases.
 5. Method according to claim 4, characterized in that, for theinactive phases, a quality value is determined which, as a function ofthe energy Ep in the inactive phases, essentially has the followingcharacteristic:$A^{\frac{\log \quad 10{({Epa})}}{\log \quad 10{({E\quad \max})}}}.$


6. Method according to claim 4, characterized in that the transmissionquality is calculated by a weighted linear combination of the similarityvalue of the active phase and the quality value of the inactive phase.7. Method according to claim 1, characterized in that before theirtransformation to the frequency domain, the source and reception signalsare respectively divided into time frames in such a way that successiveframes overlap to a substantial extent of up to 50%.
 8. Method accordingto claim 7, characterized in that, in order to perform time masking, thespectrum of a frame has the attenuated spectrum of the preceding frameadded to it in each case.
 9. Method according to claim 8, characterizedin that, before performing time masking, the components of the spectraare compressed by exponentiation with a value α<1.
 10. Method accordingto claim 1, characterized in that the spectra of the source andreception signal are each convoluted with a frequency-asymmetricsmearing function before determining the similarity value.
 11. Methodaccording to claim 10, characterized in that the components of thespectra are expanded by exponentiation with a value ε>1 before theconvolution.